Search CORE

99 research outputs found

Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection

Author: Lai Po-Ting
Tsai Richard Tzong-Han
Publication venue: BioMed Central
Publication date: 03/10/2011
Field of study

Springer - Publisher Connector

PubMed Central

A resource-saving collective approach to biomedical semantic role labeling

Author: Po-Ting Lai
Richard Tzong-Han Tsai
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

BACKGROUND: Biomedical semantic role labeling (BioSRL) is a natural language processing technique that identifies the semantic roles of the words or phrases in sentences describing biological processes and expresses them as predicate-argument structures (PAS’s). Currently, a major problem of BioSRL is that most systems label every node in a full parse tree independently; however, some nodes always exhibit dependency. In general SRL, collective approaches based on the Markov logic network (MLN) model have been successful in dealing with this problem. However, in BioSRL such an approach has not been attempted because it would require more training data to recognize the more specialized and diverse terms found in biomedical literature, increasing training time and computational complexity. RESULTS: We first constructed a collective BioSRL system based on MLN. This system, called collective BIOSMILE (CBIOSMILE), is trained on the BioProp corpus. To reduce the resources used in BioSRL training, we employ a tree-pruning filter to remove unlikely nodes from the parse tree and four argument candidate identifiers to retain candidate nodes in the tree. Nodes not recognized by any candidate identifier are discarded. The pruned annotated parse trees are used to train a resource-saving MLN-based system, which is referred to as resource-saving collective BIOSMILE (RCBIOSMILE). Our experimental results show that our proposed CBIOSMILE system outperforms BIOSMILE, which is the top BioSRL system. Furthermore, our proposed RCBIOSMILE maintains the same level of accuracy as CBIOSMILE using 92% less memory and 57% less training time. CONCLUSIONS: This greatly improved efficiency makes RCBIOSMILE potentially suitable for training on much larger BioSRL corpora over more biomedical domains. Compared to real-world biomedical corpora, BioProp is relatively small, containing only 445 MEDLINE abstracts and 30 event triggers. It is not large enough for practical applications, such as pathway construction. We consider it of primary importance to pursue SRL training on large corpora in the future

Springer - Publisher Connector

PubMed Central

Rule-based Korean Grapheme to Phoneme Conversion Using Sound Patterns

Author: Tsai Richard Tzong-Han
Wang Yu-Chun
Publication venue: City University of Hong Kong
Publication date: 01/01/2009
Field of study

PACLIC 23 / City University of Hong Kong / 3-5 December 200

Waseda University Repository

Transliteration Extraction from Classical Chinese Buddhist Literature Using Conditional Random Fields

Author: Tsai Richard Tzong-Han
Wang Yu-Chun
Publication venue: Department of English, National Chengchi University
Publication date: 01/01/2013
Field of study

Waseda University Repository

WikiSense: Supersense Tagging of Wikipedia Named Entities Based WordNet

Author: Chang Jason S.
Chang Joseph
Tsai Richard Tzong-Han
Publication venue: City University of Hong Kong
Publication date: 01/01/2009
Field of study

PACLIC 23 / City University of Hong Kong / 3-5 December 200

Waseda University Repository

Modeling the Relationship among Linguistic Typological Features with Hierarchical Dirichlet Process

Author: Lin Chu-Cheng
Tsai Richard Tzong-Han
Wang Yu-Chun
Publication venue: City University of Hong Kong
Publication date: 01/01/2009
Field of study

PACLIC 23 / City University of Hong Kong / 3-5 December 200

Waseda University Repository

Large Language Models on the Chessboard: A Study on ChatGPT's Formal Language Comprehension and Complex Reasoning Skills

Author: Hsueh Chih-Chung
Kuo Mu-Tien
Tsai Richard Tzong-Han
Publication venue
Publication date: 29/08/2023
Field of study

While large language models have made strides in natural language processing, their proficiency in complex reasoning tasks requiring formal language comprehension, such as chess, remains less investigated. This paper probes the performance of ChatGPT, a sophisticated language model by OpenAI in tackling such complex reasoning tasks, using chess as a case study. Through robust metrics examining both the legality and quality of moves, we assess ChatGPT's understanding of the chessboard, adherence to chess rules, and strategic decision-making abilities. Our evaluation identifies limitations within ChatGPT's attention mechanism that affect its formal language comprehension and uncovers the model's underdeveloped self-regulation abilities. Our study also reveals ChatGPT's propensity for a coherent strategy in its gameplay and a noticeable uptick in decision-making assertiveness when the model is presented with a greater volume of natural language or possesses a more lucid understanding of the state of the chessboard. These findings contribute to the growing exploration of language models' abilities beyond natural language processing, providing valuable information for future research towards models demonstrating human-like cognitive abilities

arXiv.org e-Print Archive

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Author: Lu Bo-Han
Lu Chao-Yi
Lu Sin-En
Tsai Richard Tzong-Han
Publication venue
Publication date: 21/01/2023
Field of study

In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.Comment: The paper was accepted by EMNLP 2022 finding

arXiv.org e-Print Archive

Korean-Chinese Person Name Translation for Cross Language Information Retrieval

Author: Hsu Wen-Lian
Lee Yi-Hsun
Lin Chu-Cheng
Tsai Tzong-Han Richard
Wang Yu-Chun
Publication venue: The Korean Society for Language and Information (KSLI)
Publication date: 01/01/2007
Field of study

PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200

Waseda University Repository

Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval

Author: Hsu Wen-Lian
Jiang Mike Tian-Jian
Kuo Chan-Hung
Shih Cheng-Wei
Tsai Richard Tzong-Han
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository